NLP Project - Surendiran Rangaraj

Date : 12/1/2021

Illinois is famous for being one of the very few states in the country with negative population growth. The objective of your final project is to:

1. Identify the key reasons for the declining population (what people like / dislike about Chicago / suburbs) by extracting meaningful insights from unstructured text
2. Provide actionable recommendations on what can be done to reverse this trend (how to make Chicago / suburbs more attractive)

You have access to a collection of ~200K news articles (about 500 MB). The news articles are related to either Chicago and / or Illinois and you can access them in the following ways:

. Download a data by following this think from your browser: https://storage.googleapis.com/msca-bdp-data-open/news/news_final_project.jsonLinks to an external site.
. Use Spark on GCP news_final_project = spark.read.parquet('gs://msca-bdp-data-open/news_final_project')
. Use Pandas from anywhere (your laptop, Colab or any cloud) df_news_final_project = pd.read_json('https://storage.googleapis.com/msca-bdp-data-open/news/news_final_project.json', orient='records', lines=True) 


To complete your assignment, I suggest considering the following steps:

. Clean-up the noise (eliminate articles irrelevant to the analysis)
. Detect major topics
. Identify top reasons for population decline (negative sentiment)
    . Suggest corrective actions
    . Plot a timeline to illustrate how the sentiment is changing over time
. Demonstrate how the city / state can attract new businesses (positive sentiment)
. Leverage appropriate NLP techniques to identify organizations and people and apply targeted sentiment
    . Why businesses should stay in IL or move into IL?
        . Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)
    . Why residents should stay in IL or move into IL?
        . Create appropriate visualization to summarize your recommendations (i.e. word cloud chart or bubble chart)

Additional guidance:

. Default sentiment will likely be wrong from any software package and will require significant tweaking
    . Either keyword / dictionary approach or
    . Labeling and classification

. You are encouraged to explore a combination several techniques to identify key topics:
    . Topic modeling (i.e. LDA using gensim or ktrain)
    . Classification (hand-label several topics on a sample and then train classifier)
    . Clustering (cluster topics around pre-selected keywords or word vectors)
    . Zero-shot (NLI) modeling
    . Please ensure your PowerPoint presentation (in PPTX or PDF format) is submitted to the course module as-is (not zipped). Otherwise I am unable to use Canvas SpeedGrader.
    . The presentation should look professional – not a collection of screenshots from your analytical software
    . Roughly 8-12 pages is reasonable for this kind of project but there are no strict restrictions.
    . On your slides you will want to provide:
        . Executive Summary
        . Methodology and source data overview
        . Actionable recommendations
        . Apply text summarization algorithms where possible to synthesize your insights
    . Please submit your actual program codes (Jupyter notebooks) along with your PowerPoint
    . The slides should be self-sufficient and after reading them, there should not be any need to read the notebook (we are still asking you to provide the notebooks as a proof or work though).
    . The slides should clearly answer all the questions and the answers should be supported with the plots/tables/numbers produced in the notebook based on the actual data.
    . The slides should contain the RIGHT amount of supporting material for each question, putting too much supporting material is as bad as putting too little: too much - you would not be able to keep the audience attention and your presentation would be a mess, too little - your statements would not look convincing.
    . Everything should be clear, logical, well organized, as simple as possible.  Use proper English and run spell check.
    . All the plots should be of production quality and easily readable. Foggy plots, untitled plots, unreadable labels, overlapping labels are unacceptable.
    . If you formatting somehow gets corrupted when you put your slides into Canvas (sometimes it happens), it is your responsibility to fix formatting. For example, try saving it in some other format like PDF, HTML.
    . Any statements you make should be supported by data. Only recommendations or goals of the project sections can contain elements not directly supported by the data
    . Please submit your actual program codes (i.e. Python Notebook) along with your PowerPoint – as a separate attachment
        . Your presentation should be targeted toward business audience and must not contain any code snippets
.You are welcome to use any software packages of your choice to complete the assignment

Import Libraries

Read Topic Filter Data

Filter Positive and Negative Sentiment articles to separate Dataframe

Topic Modeling for Filtered data - Positive Sentiment

kTrain

Save the Ktrain Topic Models

The get_topic_model function learns a topic model using Latent Dirichlet Allocation (LDA).

Build the Document-Topic matrix

The build method may prune documents based on threshold. This method prunes other lists based on how build pruned documents. This is useful to filter lists containing metadata associated with documents for use with visualize_documents.

Top-ranked document for the topic 18 : Change in income tax rate

Top-ranked document for the topic 10 : Change in income tax rate

Visualize all negative topics

Text Summarization

Sentiment Analysis Over Time

Visualization to summarize the articles

Analysis on speicific Topics

Tax , income ,business in negative sentiment